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Abstract. We study the problem of finding the global Ricmannian center of mass of a set of data 
points on a Ricmannian manifold. Specifically, we investigate the convergence of constant step-size 
gradient descent algorithms for solving this problem. The challenge is that often the underlying cost 
function is neither globally differcntiable nor convex, and despite this one would like to have guaran- 
teed convergence to the global minimizer. After some necessary preparations we state a conjecture 
which we argue is the best convergence condition (in a specific described sense) that one can hope for. 
The conjecture specifies conditions on the spread of the data points, step-size range, and the location 
of the initial condition (i.e., the region of convergence) of the algorithm. These conditions depend 
on the topology and the curvature of the manifold and can be conveniently described in terms of the 
injectivity radius and the sectional curvatures of the manifold. For manifolds of constant nonncgativc 
curvature (e.g., the sphere and the rotation group in R 3 ) we show that the conjecture holds true (we 
do this by proving and using a comparison theorem which seems to be of a different nature from 
the standard comparison theorems in Riemannian geometry). For manifolds of arbitrary curvature 
we prove convergence results which are weaker than the conjectured one (but still superior over the 
available results). We also briefly study the effect of the configuration of the data points on the speed 
of convergence. 

1. Introduction and Outline 

The (global) Riemannian center of mass (a.k.a. Riemannian mean or average or Frechet mean)Q of 
a set data points {xi}f =1 in a Riemannian manifold M is defined as a point x € M which minimizes 
the sum of squares of geodesic distances to the data points. This notion and its variants have a long 
history with several applications in pure mathematics (see e.g., [6], [17], [19], [22] and also [3] for a 
brief history and some new results). More recently, statistical analysis on Ricmannian manifolds and, 
in particular, the Riemannian center of mass have found applications in many applied fields. These 
include fields such as computer vision (see e.g., [36] and [35]), statistical analysis of shapes (see e.g., 
[2"T] . [25] . [8], [15], and [7]), medical imaging (see e.g., [14] and [30]), and sensor networks (see e.g., 
[34] and [33]), and many other general data analysis applications (see e.g., [23], [2], [9], and [29]). In 
these applied settings one often needs to numerically locate or compute the Riemannian center of mass 
of a set of data points lying on a Riemannian manifold, and the so-called (intrinsic) gradient descent 
algorithm has been a popular choice for this purpose. The constant step-size version of the gradient 
descent for finding the Riemannian center of mass is the easiest one to implement, and hence is the 
most popular one. 

The main challenge in analyzing as well as working with the constant step-size gradient descent 
algorithm for finding the Riemannian center of mass is the fact that the underlying cost function is 
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throughout this paper we are mainly interested in the "global" Riemannian center of mass, hence in reference to it 
very often we drop the term "global" and simply use "Riemannian center of mass," etc. If need arises we explicitly use 
the term "local" in reference to a center which is not global (see Definition [275jl . 
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usually neither globally differentiable nor globally convex on the manifoldlfl More specifically, the 
cost function is not differentiable at any cut point of the data points. Moreover, as it can be shown 
by simple examples, the cost function can have local minimizers, which are not of interest and should 
be avoided. Nevertheless, we expect and hope that if the algorithm is initialized close enough to the 
(unknown) global Riemannian center of mass x and the step-size is small enough, then the algorithm 
should converge to the center. One would like to have the step-size small enough such that the cost 
is reduced at each step and at the same time the iterates do not leave a neighborhood around x in 
which x, is the only zero of the gradient of the cost function (recall that a gradient descent algorithm, 
at best, can only locate a zero of the gradient vector field of the cost function). On the other hand, 
one would like to have large enough step-size so that the convergence is fast. The interplay between 
these three constraints is important in determining the conditions guaranteeing convergence as well as 
the speed of convergence. 

The goal of this paper is to give accurate conditions that guarantee convergence of the constant 
step-size gradient algorithm to the global Riemannian center of mass of a set of data points. In Section 
[21 we first briefly give the necessary backgrounds on the Riemannian center of mass and the gradient 
descent algorithm for finding it, these include the notions of convex functions and sets in Subsection 
I2.1.2| differentiability and convexity properties of the Riemannian distance function and bounds on its 
Hessian in Subscction l2.1.31 Riemannian center of mass and its properties in Subsection l2.1.41 a general 
convergence theorem for gradient descent in Subsection 12.1.51 and a convergence theorem estimating 
the speed of convergence and the best step-size in Subsection l2.1.6l Following that, in Subsection 12. 21 
we state Conjecture 12. 151 in which we specify the best "condition for convergence" one can hope for (in 
a sense to be described). Specifically, we specify a bound on the radius of any ball around the data 
points in which the algorithm can be initialized together with an interval of allowable step-size so that 
the algorithm converges to the global center of mass x. The significant point is that for convergence 
the radius of the ball does not need to be any smaller than what ensures existence and uniqueness 
of the center. Moreover, the step-size can be chosen equal to the best (in a specific described sense) 
step-size under the extra assumption that the iterates stay in that ball; and it is conjectured that, 
indeed, the iterates stay in the ball. Knowing the conjecture helps us to compare and evaluate the 
existing results mentioned in Subsection 12.31 as well as the results derived in this paper. In Section [3] 
(Theorem I3.6[) . we prove Conjecture 12.151 for the case of manifolds of constant nonnegative curvature. 
In our developments in this section, we first prove comparison Theorem 13.11 in Subsection 13.11 This 
comparison result (which differs from standard comparison theorems in some aspects) most likely has 
been known among geometers, but we could find neither its statement or a proof for it in the literature. 
In Subsection l3.2l we make sense of an intuitive notion of Riemannian affine convex combination of a set 
of data points in a manifold of nonnegative constant curvature and we explore its relation to the convex 
hull of the data points. These prepare us to prove the main theorem of the section, Theorem 13. 6i in 
Subsection l3.3l Although limited in scope, this result covers some very important manifolds of practical 
interest: § ra the unit sphere in R ra+1 , 50(3) the group of rotations in Mr, and RP n the real projective 
space in R" +1 . In Section 2J for manifolds of arbitrary (i.e., non-constant) curvature, we derive two 
classes of sub-optimal convergence conditions (in comparison with Conjecture 12. 151) : In Subsection 03] 
we give a result (Theorem 14. ip in which convergence is guaranteed at the expense of smaller spread 
of data points, whereas in Subsection 14.21 (Theorem I4.2j) the allowable step-size is compromised to 
guarantee convergence. Finally, in Section [5l we qualitatively identify data configurations for which 



For us global differentiability means differentiability everywhere on the manifold; however, we use the term "global" 
to remind ourselves that our functions of interest (e.g., the Riemannian distance from a point) lose differentiability at 
far away distances. 

^In fact, it is well known that on a compact Riemannian manifold the only globally continuous convex functions are 
constant functions (see Theorem 12. 21 and 39 ). 

^This step-size, in general, depends on an upper bound on the sectional curvatures of the manifold and the radius of 
the mentioned ball. However, interestingly, for a manifold of non-negative curvature it is simply 1 (see Conjecture 12.151 
and Remark 12.161 . 

5 The main challenge in proving this conjecture is to prove that the iterates stay in the ball containing the data points. 
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the convergence is very fast or very slow. We conclude the paper in Section |6] by giving some remarks 
about further research. 

2. Preliminaries, a conjecture, and prior work 

2.1. Preliminaries on the Riemannian Center of Mass and the Gradient Descent Algo- 
rithm. 

2.1.1. Notations. Let M be an n-dimensional complet<3 Riemannian manifold with distance function 
We denote the tangent space at x e M by T X M and by (•,•} and || • || we mean the Riemannian 

structure and the corresponding norm, respectively (dependence on the base point is implicit and clear 
from the context). By a C k function in a subset of M we mean a continuous function which is k th 
order continuously differentiable in the subset as commonly understood in differential geometry. For 
a function / : M -*■ R, V/ denotes its gradient vector field with respect to (•,•). We assume that the 
sectional curvatures of M are bounded from above and below by A and 5, respectively. The exponential 
map of M at x e M is denoted by exp a .(-) : T x -*■ M and its inverse (wherever defined) is denoted by 
exp~ (•). The injectivity radius of M is denoted by injM > 0. An open ball with center o e M and 
radius p is denoted by B(o,p) and its closure by B(o,p). 

2.1.2. Convex functions and sets in Riemannian manifolds. Convexity plays an important rule in our 
developments and we have the following definition: 

Definition 2.1. Let A be an open subset of M such that every two points in A can be connected 
by at least one geodesic of M such that this geodesic lies entirely in A. Assume that / : A -* R is a 
continuous function. Then / is called (strictly) convex if the composition / o 7 : [0, 1] -*■ R is (strictly) 
convex for any geodesic 7 : [0, 1] A. We say that / is globally (strictly) convex if it is (strictly) 
convex in M. 

If / is C 2 in A, then convexity (strict convexity) of / is equivalent to 5^2 /(7(f) )|t=o ^ (> 0), where 
7 : [0, 1] -> A is any geodesic in A. 

An insightful fact is the following [39J : 

Theorem 2.2. The only globally convex function in a compact Riemannian manifold is a constant 
function. 

It is more convenient to limit ourselves to strongly convex subsets of M because they are quite 
similar to standard convex sets in R": 

Definition 2.3. A set A c M is called strongly convex if any two points in A can be connected by a 
unique minimizing geodesic in M and the geodesic segment lies entirely in A. 

Define 

1 7T 

(2-1) r cx = - minjinjM, y=}> 

with the convention that = 00 for A < 0. An open ball B(o,p) with p < r cx is strongly convex in 

M; the same holds for any closed ball B(o,p) if p < r cx (see e.g., [32l p. 404] and [TTJ pp. 168-9]). In 
fact, B(o, p) with p < r cx is even more similar to a convex set in Euclidean space: for any x, y e B(o, p) 
the minimal geodesic between from x to y is the only geodesic connecting them which lies entirely in 
B(o,p). 



Our results are local in nature, but they hold in relatively large regions that are determined explicitly. Therefore, 
completeness of the manifold is not necessary and our results could be easily adapted to e.g., non-singular regions of a 
singular manifold. However, for this purpose the statements of our results could become rather cumbersome. 
7 In our definitions relating to Riemannian manifolds we mainly follow 1321 . 
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2.1.3. Differentiability and convexity properties of the distance function and estimates on its Hessian. 
Now, we briefly give some facts about the Riemannian distance function which will be used throughout 
the paper. Let y e M . The function x i-> d(x, y) is continuous for every x e M . However, it is 
diffcrentiable (in fact C°°) in M \ ({y} uC y ), where C y is the cut locus of y (see e.g., [32j pp. 108-110]). 
We recall the notion of cut locus: Let D y e T X M be the largest domain containing the origin of T y M 
in which exp y : T y -> M is a diffcomorphism, and let C y be the boundary of D y . Then C y = cxp y (C y ) 
is called the cut locus of y and one has M = cxp y (D y u C y ) (see e.g., [TTJ pp. 117-118] or [32j p. 
104]). The distance between y and C y is called the injectivity radius of y denoted by injy, and by 
definition injA/ = inf ye M injy. It is well known that C y has measure zero in M. On the unit sphere § n 
the cut locus of every point is its antipode. In a general manifold M the differentiability property of 
x 1-+ d(x, y) at x = y is similar to the case where M = R™ equipped with the standard Euclidean metric; 
in particular, x >-*■ ^d 2 (x,y) is a C°° function in M ^C y . However, the behavior at far away points 
(e.g., the cut locus) is of substantially different nature and depends on the topology and curvature of 
M (recall that in Euclidean space the cut locus of every point is empty) . 

Next, we recall some useful estimates about the Hessian of the Riemannian distance function. We 
adopt the following definitions: 



(2.2) sn K (Z) 



and 



-^sin(V«0 k>0 (- ^cot(v7s0 k>0 



1 *-n _x i 



ct K (/)= 7 K = 



-7=sinh(yH0 k<0 [ coth(y/\n\l) k<0 



bK ^ j_i i k<o c «W-i ^cothCvP) K <o. 



Assume that x e M is such that d(x,y) < minjinjy, "^}o Furthermore, assume that t >->■ j(t) with 
7(0) = x is a unit speed geodesic making, at 21, an angle /3 with the minimal geodesic from y to x. It 
can be proved that (see e.g., [32l pp. 152-154]) 

d 2 

(2.4) ct A (d(a:, y)) sin 2 /3 < _d( 7 (t), y)| 4=Q < ct a (d(a:, y)) sin 2 /3, 
where ct K is defined in (|2.2p . Based on the above one can verify that 

(2.5) h A (d(x,y)) < ^(id 2 ( 7 (t),y))| t=o < c 5 (d(x,y)), 
and more generally that 

(2.6) ^- 2 (x,y)min{p-l,b A (d(.T,y))}<|^ 

Notice that the above confirms the intuition that the differentiability behavior of x ^d p (x,y) at 
y = x is the same in M and in K n . We will use these two relations very often; in doing so it is useful 
to have in mind that b K (l) and c K (l), respectively, are decreasing and increasing in I. 

Remark 2.4. We emphasize that the requirement d(x, y) < injy is essential and cannot be compromised, 
as at a cut point the distance function becomes non-differcntiable. Although in the left hand side of 
the above bounds only A appears explicitly both the curvature and topology of M determine the 
convexity properties of the distance function. Notice that A gives (some) information about the 
Riemannian curvature tensor of M and injy (or injM) gives (some) information about the topology of 
M. For example, R and the unit circle S 1 both have zero sectional curvature, while injS 1 = n and the 
injectivity radius of R is infinity. Now obviously x >-* ^d 2 (x,y) in R is globally convex, while in S 1 it is 
not (as seen directly or as a consequence of Theorem 12. 2|) . The interesting fact is that x >-*■ |d 2 (a;,y) 



^Instead of this condition, it is often more convenient to require d(x, y) < minjinjM, -p= }, which is a more conservative 
yet global version of the condition. 
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has positive definite Hessian in S 1 \ {y'}, where y' is the antipodal point of y. At y', x >-*■ ^d 2 (x,y) 
is not diffcrcntiable. This non-differentiability is so severe that in spite of the positivity of its Hessian 
at all other points x >->• ^d 2 (x, y) is far from a globally convex function on S 1 . As an example of a less 
severe non-differentiability notice that x d(x,y) in R is also non-differentiable at x = y, yet this does 
not affect its convexity. 

2.1.4. Riemannian LP center of mass. We start by the following definition: 

Definition 2.5. The (global) the Riemannian LP center of mass or mean (a.k.a. Frechet mean) of the 
data set {xi}^ c M with respect to weights < Wi < 1 (£i=i Wi = 1) is defined as the minimizer(s) of 

(2.7) f p (x) = { jXli^i^i) ^P<°° 



m&Xid(x,Xi) p 



oo. 



in M. We denote the center by x p . We call a local minimizer of f p (which is not global) a local center 
of mass of the data set with respect to the weights^ 

The reader is referred to [3J for details and other related definitions. As a convention when referring 
to the center of mass of some data points we usually do not refer to explicit weights unless needed. As 
another convention when p is not specified we assume p = 2, which is the most commonly used case. 
Although p = 1 and p = oo are also used often, in this paper our focus is limited to 2 < p < oo0 The 
reason is that in our analysis we require f p to be twice-continuously diffcrcntiable and we determine the 
constant step-size of the gradient algorithm in terms of the upper bounds on the eigenvalues Hessian 
of f p . As in Euclidean case, in the more general Riemannian case also one can see from (|2.6p that for 

1 < p < 2 the Hessian of f p might be unbounded. It is well known that Lipschitz continuous Hessian 
(in particular bounded Hessian) is necessary for the convergence of a gradient descent method with 
constant-step size [51] . 

Although f p : M ->■ R is a globally convex function when M is a Euclidean space (or more generally 
an Hadamard manifold), fi is not globally convex for an arbitrary M. In particular, the center of 
mass might not be unique; however, if the data points are close enough, then the center is unique. The 
following theorem gives sufficient conditions for existence and uniqueness of the Riemannian center of 
mass. 

Theorem 2.6. Let {xi}f =l c B(o,p) and assume < Wi < 1 with u>,: = 1. For p > 2, if p < r cx , 
then the Riemannian LP center of mass x p is unique and is inside B(o,p). For 2 < p < oo it is the 
unique zero of the gradient vector field V/ p in B(o,p), and moreover if no data point has weight 1, 
then x p is a non-degenerate critical point of f p (i.e., the Hessian of f p at x p is positive-definite). F^l 

For a proof see [3J. Also for 1 < p < 2 see [3J and [35]. In this paper, we are mainly interested in 

2 < p < oo, because the bounds we derive on the step-size of the gradient descent algorithm depend on 



9 A local minimizer of f p in M is sometimes called a Karcher mean, although this definition bears little relation with 
the way Grove and Karcher [17| or Karcher [19] originally defined what they called the Riemannian center of mass (see 
also [3] for more details). Given c B{o, p) with small enough p those authors defined the center of mass as a zero 

of V/2 in B(o,p) or alternatively as a local minimizer of /2 in B(o,p) (Notice that given that /2 is not diffcrcntiable 
at the cut locus of each data point it is not a-priori clear that, in general, a local minimizer of /2 in M should coincide 
with a zero of V/2, although there are some evidence in the literature that this situation might not happen, see e.g,. 
[18] . [10] 1 . Overall, there is no consensus among authors about the terminology and we find it more convenient to use 
"local" and "global" Riemannian center of mass as defined here. 

^Going from p = 2to2<p<oo does not introduce any major technical challenge, so our analysis can be applied 
almost uniformly to 2 < p < 00. However, since the results for p = 2 are easier to state and presumably used more often, 
we usually state a result for p = 2 and then give the more general version for 2 < p < 00. The only exception is Theorem 
!4.2l for which we do not give a 2 < p < 00 version. 

^In Theorem 2.1 of [3], the condition on p is stated as p < r cx , but since we have finite number of data points the 
current version follows immediately. Also from the statement of Theorem 2.1, x p is the only zero of V/p in B(o, p), but 
from the proof of the theorem it can be seen that the vector field -V/p is inward-pointing on the boundary of B(o, p) and 
hence x p is the only zero in entire B(o, p). In addition, the fact about non-degeneracy is not present in the statement 
of Theorem 2.1 of [3], however it is proved in the proof of the theorem. 
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an upper bound on the eigenvalues of the Hessian of f p , and for 1 < p < 2 (and p = oo) the Hessian is 
unbounded. We refer the reader to [35] and [3] for algorithms in the case of p = 1 and p = oo. 

A rather subtle issue is that according to Theorem l2.61 if p < r cx , then f p has a unique minimizer in 
B(o,p). Nevertheless, / p |b(o,p); the restriction of f p to B(o,p), might not be a convex function. More 
accurately, if p < | minjinjM, (cf. (|2.1jl ). then f p \s{o,p) is a convex function (this can be seen 

easily from (|2.5[) by noting that b&(x) > for x e B(o,p)). However, if A > and -^7= < p < r cx , then 
fp\B(o,p) might not be a convex function, as can be seen by very simple examples. While this fact does 
not harm implementation and convergence of a gradient descent algorithm for finding x p greatly, it 
has some serious implications on implementation and applicability of Newton's method (see Remarks 
Cm and OH). 

2.1.5. Gradient descent algorithm for finding the Riemannian center of mass. For later reference we 
derive the intrinsic gradient descent algorithm for locating the Riemannian LP center of mass (see pQ 
or |37) for introduction to optimization on Riemannian manifolds). One can check that 

N 

(2.8) yfp(x) = -Y, w id P ~ 2 (x,Xi) exp' 1 x i} 

i=i 

for any x e M as long as it is not in the cut locus of any of the data points. In particular, if 
{Xi}^ =1 c B(o,p), where p < then for any x e B(o,p) the above expression is well-defined in the 

classical sense (i.e., it is uniquely defined). Notice that the above expression is well-defined for almost 
every x e M because the set at which f p is not diffcrentiable has measure zero (for p > 1 this set is 
UiC Xi and for p = 1 it is UiC Xi u {xi}i). As we will see this non-differentiability has severe implications 
on the behavior of the constant step-size gradient descent. 

Algorithm [1] in Table Q] is a gradient descent algorithm for locating the Riemannian center of mass 
of {xi}fii. Besides practical considerations (e.g., stopping criterion), at least two important issues are 

Algorithm 1: Gradient Descent for finding the Riemannian LP center of mass 

(1) Consider {xi}f =l c B(o,p) c M and weights {wi}^ and 
choose x° e M. 

(2) if Vfp(x k ) = then stop, else set 

(2.9) x k+1 =cx Pxk (-t k Vf p (x k )) 

where t k > is an "appropriate" step-size and V/ p (-) is 
defined in (f2~8|) . 

(3) goto step [2] 

Table 1. Gradient descent for finding the Riemannian L p center of mass. 



left unspecified in Algorithm [Q namely, how to choose x° and how to choose t k for each k. The most 
natural choice for x° is one point in B(o,p), say one of the data points. Note that in practice o and 
the exact value of p might not be known. 

The choice of t k is more complicated. The next general proposition gives a prescription for a step- 
size interval which ensures reducing the cost function at an iteration of a gradient descent provided 
one knows an upper bound on the eigenvalues of the Hessian of the cost function in a region which 
iterates live. The proof of this proposition follows from the second order Taylor series expansion (with 
remainder). 

Proposition 2.7. Let x e M and consider an open neighborhood S c M containing x. Let f ■ M -*■ R 

be a function whose restriction to S is twice- continuously differ entiable and let the real number Hs be 
an upper bound on the eigenvalues of the Hessian of f in S . There exists t x ^s > such that for all 
t e [0,t Xt s) the curve t >->• exp x (— 4V/(x)) does not leave S and 

(2.10) /(exp a (-tV/(x))) < f(x) - \\Vf(x)ft + Hsl V f 
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For t e (0, min{t Xt s, with the convention that -g- = +oo for Hs < 0, we have f (exp x (-t\7 f (x))) < 

f(x) with equality only if x is a critical point of f . Moreover, when Hs > the right hand side of 
V2.10]) is minimized for t = -g- . 

Notice that the fact that for t e [0,t Xt s) the curve t i-* exp x (-tV f (x)) stays in S is crucial in 
enabling us to use the upper bound Hs and derive (|2.10|) . This concept appears frequently in our 
analysis and it useful to have the following definition: 

Definition 2.8. Let x k e S c M. We say that iterate x k+1 of Algorithm [1] stays in S if x k+1 = 
exp x k(-tkV f p (x k )) e S. We say that the iterate x k+1 of Algorithm [T] continuously stays in S if 
(sVf p (x k ))eS for se[0,t k ]. 

Obviously, continuously staying in S is stronger than staying in S. However, they are equivalent 
under some conditions (which hold in some, but not all, of the cases we study, see Remark l2.18[) : 

Proposition 2.9. If S is a strongly convex set and tk || Vf p (x k ) \\ < injM for every k>0, then for the 
iterates of Algorithm]]] staying in S implies (and hence is equivalent to) continuously staying in S. 

Proof. Assume that x k and x k+1 both belong to S. Recall that t >->■ exp x (-tV f p (x k )) for t e [0,tfc] is a 
minimizing geodesic if tt\\ V/ P (x fc )|| < injM. Therefore, by strong convexity of S, t exp x (-t\7 f p (x k )) 
for t € [0,ifc] must be the only minimizing geodesic between x k and x k+1 and must lie in S entirely. □ 

As mentioned before we are only interested in constant step-size gradient descent. The following 
convergence result is standard for this type of algorithms when the cost is C 2 (or at least has Lipschitz 
gradient) and is globally convex; however, our version is adapted to f p which, in general, is neither 
globally C 2 nor convex. The assumption of the theorem that each iterate of the algorithm continuously 
stay in a neighborhood S of x p , in which is x p is the only zero of V/ p , is a crucial enabling ingredient 
of the proof. In fact, our goal in Sections [3] and |4] is essentially to identify such a neighborhood (under 
certain conditions). 

Theorem 2.10. Let 2 < p < oo and assume that x p is the center of mass of {xi\f =1 c B(o,p), where 
p< r cx . Let S be a bounded open neighborhood of x p such that f p is C 2 in S and C 1 in S, the closure 
of S. Furthermore, assume that x p is the only zero of the vector field V f p in S . Let Hs be an upper 
bound on the eigenvalues of the Hessian of f p in S. In Algorithm^ choose t^ = t e (0,-^). If starting 
from x° € S , each iterate of Algorithm^ continuously stays in S , then f p (x k+1 ) < f p (x k ) for k > with 
equality only if x k = x p and x k converges to x p . 

Proof. Since by assumption x k e S for k > and S is compact, there is a subsequence (x kj )k j converging 
to a point x* e S. By Proposition 12 . 71 we have f p (x k+1 ) < f p (x k ) unless x k = x p and furthermore 

(2.H) t{\ - ^f) t |V/ P (^)|| 2 < f p (x k+1 ) - f p (x°), 

for every k > 0. Since {f P (x k ))k is a bounded sequence, the above implies that ||V/p(x fc )|| -*■ 0; hence, 
by continuity of V f p we have ||V/ p (a;*)|| = 0, that is, x* is a zero of V/ p in S. But by the assumption 
about S this means that x* coincides with x p and therefore x k -> x p . □ 

Next, we give a very simple but insightful example. 

Example 2.11 (Finding the mean of two points on the unit circle S 1 ). Let M be the unit circle S 1 
centered at the origin (0,0) and equipped with the standard arc length distance d. Recall that A = 
and injM = 7T. Let o denote the point (1,0) (see Figured]). We consider two data points x\,X2 e S 1 
represented as xi = (cos8i, sin#i) (i = 1,2) where < Q\ < p < f and 82 = -6\. We specify the weights 
and 9i later. Under the mentioned assumption that p < f, Theorem 12.61 guarantees that the center of 
mass x is unique and in fact one can see that x = (cosO, sin#) where 8 = wxOi +W282. More importantly, 
it also follows that x is the unique zero of V/2 in B(o, p) (as well as B(o, ^)). Notice that f% is smooth 
within B(o,ir - 81) (it does not follow from Theorem 12.61 but it is an easily verifiable fact that x is 
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the unique zero of V/2 in B(o,ir - 61)). On the other hand, f 2 is not differentiable at x{ and x' 2 , the 
antipodal points of x\ and x 2 , respectively. Furthermore, in S 1 \ {x^x^} the Hessian of f 2 is defined 
and is equal to 1, hence the largest possible range of the constant step-size tk = t is the interval (0, 2). 
Next, we see under what conditions Theorem l2. 101 applies. It is easy to check that, independent of the 
weights, with a; e B(o,p) and step-size tk = t e (0,1] the iterates of Algorithm [T] continuously stay in 
B(o,p). Therefore, an acceptable S is the ball B(o,p) and Theorem 12.101 ensures convergence to the 
global center x provided step-size t is in the interval (0, 1]. This result is essentially not different from 
what we have in K. However, the situation for t e (1,2) is rather subtle since with a large step-size the 
iterates might leave the ball B(o,p) or even B(o,tt - 0i) and enter a region in which there is another 
zero of V/2- To be specific notice that f 2 can be parameterized with e (-ir, +ir] as 

. f Wi(0-01 - 2lT) 2 + W 2 (6 - 6 2 ) 2 -7T<0<0 1 -7T 

(2.12) f 2 (9) = -\ Wl {6 - 0i) 2 + w 2 (9 - 9 2 f 1 -^<0<0 2 + tt 

2 [ Wl(6 - 0l) 2 + W 2 (9 - 2 + 27r) 2 2 + 7T < 9 < IT. 

Now, let us fix 0i = ^ (and 02 = -Tf)- First, let w\ = ^ and w 2 = y^. The solid curve in the right 
panel in Figure Q] shows the graph of /2(0)- The two cranks in the curve at 9[ = Z | 2L and 9' 2 = *f are 
due to the non-differentiability of f 2 at antipodal points of x\ and X2- If we run Algorithm Q] with 
x° = x\ and step-size t = || we have x 1 = x[, thus x 1 coincides with a non-differentiable critical point 
of /2 at which the algorithm is, in fact, not well-defined. In the generic setting the probability of 
this happening is zero; however, for larger t, x 1 will leave B(o,ir - 0i). It can be seen that for this 
specific pair of weights V/2 has only one zero in S . Consequently, in practice, Algorithm [T] for almost 
every initial condition in S 1 and step-size tk = t in the interval (0, 2) will find the global center of mass 
x = (cos 0, sin 0), where = = ^jr- (this fact does not follow from Theorem 12.101 but is not difficult to 
verify in this special case, see also Remark l2.12|) . But we might not be this lucky always! For example, 
let wi = 2 and w 2 = |. The dashed curve in FigurcQ]show /2(0) for this pair of weights. One can verify 
that in addition to the global minimizer = this time, /2(0) has a local minimizer at 0' = Now 
if we run Algorithm [T] with x° = x\ and constant step-size tk = t where t = then we have x 1 = -j^-, 
i.e., the next iterate coincides with the local center x' = (cos 0', sin 0') and the algorithm gets stuck at 
the wrong center! For values of t slightly smaller or larger than the algorithm still converges to x' . 
Notice that this happens despite the fact that the cost is reduced at each iterationJ3 Although this 
simple example does not show the effect of the curvature of the manifold, still it makes it clear that 
in order for Algorithm [T] to have a predictable behavior that is as data-independent as possible it is 
important to identify conditions under which the assumptions of Theorem 12.101 are satisfied (mainly 
that the iterates continuously stay in a candidate S). Our efforts in the following sections are primarily 
directed toward finding a set S, the radius p of a ball containing the data points, and the step-size 
range which guarantee that the iterates continuously stay in S. 

Remark 2.12 (Local vs. global behavior). Our focus in this paper is on finding the global center of 
mass using a constant step-size gradient descent algorithm; hence, we only study the local behavior 
of the algorithm (albeit in relatively large domains), and we leave the global behavior analysis of the 
algorithm to further research. In this regard, one particular important question is: "How (if possible 
at all) could one guarantee that for almost every initial condition in M the algorithm converges to 
a local Riemannian center of mass?" Notice that this is a relevant question, in particular, since the 
underling cost function is not differentiable globally and e.g., a-priori one could not rule out a possible 
oscillatory behavior. 

2.1.6. Speed of convergence and the best step-size. In Proposition 12. 7\ t = -g- is the best step-size in 
the sense that in each iteration it gives the most reduction in the upper bound of f p (x k+1 ) described 
in the right hand side of (|2.10[) . The following theorem relates this choice to the speed of convergence 

12 It would be interesting to see whether an example (in S 1 or another manifold) exists in which due to the non- 
diffcrcntiability of /2, we have /2(a ) > f2(%°) if x 1 does not continuously stay in S. Such a phenomenon could lead to 
an oscillatory behavior (see Remark l2.12H . 
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Figure 1 . The left panel shows the configuration of data points x\ and X2 in Example 
12.111 and the right panel is the graph of fzid) for two different pairs of weights. The 
function f 2 is non-diffcrentiable at x[ and x' 2 , the antipodal points of X\ and x 2 . 
The example shows that if the step-size is large the iterates might leave the region in 
which / 2 is smooth and the algorithm might converge to i', a local center of X\ and 
X2, instead of the global center x (despite the fact that the cost is reduced at each 
step) . Notice that the correct way of thinking about the plotted graphs it to visualize 
them while identifying points 9 = —n and 8 = +tt or to think of them as periodic graphs 
with period 2tt. 



of the algorithm. The proof of the theorem is adopted from [37J P- 266, Theorem 4.2] where a proof 
is given for a globally convex C 2 function. Here, we adapt that proof to constant step-size gradient 
descent for minimizing f p (which is only locally C 2 and convex) rl 

Theorem 2.13. Set 2 < p < oo. Let x p be the IP center of mass of {xi}f =l c B(o,p) c M, where 
p < r cx . Suppose that S is a strongly convex neighborhood around x p in which f p is twice- continuously 
differentiable, and let hs and Hs, respectively, denote a lower and upper bound on the eigenvalues of 
the Hessian of f p in S. Furthermore, assume that S is small enough such that one can choose hs > 0. 
In Algorithm^ choose a constant step-size tf. = t e (0, j 2 ^)- If after a finite number of iterations k! 
each iterate continuously stays in S, then for k > k' we have 

(2.13) d{x k ,x p )<Kq k ^ L . 
In the above K and q are defined as 

(2.14) K J^ k ')-U^))Y ar ^ =1 _ a(1 _«)^(l + ^), 
y ' \ h s ) V 2>H S K H s h 

where a = tHs ■ In particular, < q < 1 and x k -*■ x p as k -> oo . 

Proof. Let 7(£) = exp x (t exp^ 1 x p ) be the minimal geodesic from x e S to x p . Note that ^(t) € S for 
t e [0, 1] due to strong convexity of S. After writing the second order Taylor's series of t h> f P ("f(t)) 
around t = and using the bounds on the Hessian of f p one gets 

hs Hs 

(2.15) - d(x,x p )\\X7f p (x)\\ + —d 2 (x,x p ) < f p (x p ) - f p (x) < d(x,x p )\\V f p (x)\\ + —d 2 (x,x p ). 



It should be clear that nothing is special about f p and x p in Theorems 12.101 and 12.131 Both theorems hold true 
if f p is replaced with any function / : M -> R which is twice-continuously differentiable in S satisfying the respective 
assumptions of the theorems and x p is replaced with a non-degenerate local minimizer of / (in fact, for Theorem 12. 101 
an isolated local minimizer suffices). 
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Similarly by expansion of t >-»■ f p (j(t)) around t = 1 one gets 

(2.16) ^-d 2 (x,x p ) < f p (x) - fp(xp) < ^-d 2 {x,x p ) 

Also notice that by the first order Taylor series expansion of t >-»• V fp(j(t)) around t = 1 we have 

(2.17) h s d(x,x p ) < \\Vf p (x)\\ < H s d(x,x p ) 
From the left inequality in (|2.15[) and the right inequality in (|2.16[) we have 

(2.18) f p (x) - fp(xp) < d(x,x p )\\X7fp(x)\\ - hjL(f p ( x )-f p ( Xp )). 

Hs 

Combining this with the left inequality in (|2.17[) results in 

(2-19) Ml + ^)(f P (x) ~ Wp)) * |V/ p (x)|| 2 . 

Now assume k > k! so x k ,x k+1 e S. Then subtracting / p (i p ) from both sides of (|2.10l) and using (|2.18l) 
both at x = x k yield 

(2.20) f P (x k+1 ) ~ fp(x P ) < q(f P (x k ) - fp(xp)). 
Therefore, we have 

(2.21) f p (x k ) - f p (xp) < (f p (x k ') - fp(x p ))q k - k ' , 

for k> k'. Now combining this with the left inequality in (|2.16[) yields (|2.13j) . □ 

This theorem predicts a lower bound on the speed of convergence (i.e., the actual convergence is 
not worst than what the theorem predicts). The accuracy of this prediction, in part, depends on the 
accuracy of our estimates of the lower and upper bounds on the eigenvalues of the Hessian. Observe 
that when we are only given Hs, a = 1 (i.e., tfc = j^) gives the smallest a-priori q. We call t% = 
the best a-priori step-size given Hs- 

Remark 2.14 (Asymptotic q). Notice that a strongly convex S which works in Theorem 12.131 might 
not work in Theorem 12.101 (since the assumption hs > might not hold true) and a smaller S might 
be needed for this theorem. Nevertheless, if we start with an S (and a corresponding Hs) for which 
Thcorcni l2.10l holds. then there exists a smaller strongly convex S"(c S) for which hs> > and Theorem 
12.131 holds. Hs is still an upper bound on the eigenvalues of the Hessian of f p in S' and in lack of 
any knowledge about S' still t = jj- is the best a-priori step-size. The actual asymptotic speed of 
convergence is determined by q in a very small neighborhood S' of x p . In fact, in the limit hs 1 and 
Hs> converge, respectively, to A^ in and A^ ax the smallest and largest eigenvalues of the Hessian of f p 
at x p . Therefore, for any step-size t e (0, y- — ), we define the associated asymptotic q, denoted by q' , 
where in (|2.14[) . hs, Hs, and a are replaced, respectively, by X' min , A^^, and a' = t\' max . Notice that 
if t = -jjr-, then a' < 1 and the smaller the Hs the smaller the q' will be. In general, "smaller Hs" 
means that either we make S smaller or we choose a more accurate upper bound on the eigenvalues of 
the Hessian of f p in S. 

2.2. A Conjecture: The Best Convergence Condition. As mentioned before, reduction of the 
cost at each iteration is not enough to guarantee the convergence of Algorithm [T] to the global center of 
mass. Nevertheless, we conjecture that if the constant step-size is chosen not too large and the initial 
condition is not far from x p (as specified next), then the cost f p can be reduced at each iteration, the 
iterates stay close to x p and converge to it . 

Conjecture 2.15. Set p = 2 and let X2 be the L 2 center of mass of {xi}^ =1 c B(o,p) c M where 
p < r cx . Let i?s( Oj|0 ) = 0,5(2,0), where c K is defined in 112.3]) . In Algorithm^ assume x° € B(o,p) 
and choose a constant step-size t^ = t, for some t e (0, — ]■ Then we have the following: Each 
iterate continuously stays in B(o,p) (and hence the algorithm will be well-defined for every k > 0), 
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/2(x fc+1 ) < f2(% k ) (k > 0) with equality only if x k = X2, and x k -* Xi as k ->■ 00. More generally, for 
2 < p < 00 the same results hold if t e (0, -77 — ], where Hbi p) p = (2/3) p_2 max{p - 1,0,5(2/?)}. 

Now we explain the sense in which this conjecture is the best result one can hope for. We narrow 
down our desired class of convergence conditions to a class which gives conditions that are uniform 
in the data sets and in the initial condition. More specifically, we consider the following general and 
natural class of conditions: 

Convergence Condition Class (C): Consider Algorithm [T] and fix 2 < p < 00, and let S 
and A, respectively, be a lower and upper bound on sectional curvatures of M. Specify 
the largest p where < p < r cx such that for every p < p there are 

(1) a number p' SAiP>p (p < Ps.a, p , p < r cx) depending only on S, A, p, and p; and 

(2) another number ts t A.p.p',p depending only on 5, A, p, p, and p' s A , 

for which the following holds: for every ball B(o, p) c M, for every set of data points in 
B(o,p), for every set of weights in (|2.7j) . for every initial condition in B(o,p), and for 
every constant step-size tk = te (0,ts,A,p,p',p]i each iterate of Algorithm [T] continuously 
stays in B(o,p' SApp ) and x k ->■ x p . 

First, notice that with p > r cx there is no hope to have convergence to the global center (in this 
class), since the global center might not in B(o,p) and in general V/ p might have more than one zero 
in B(o,p). Next, observe that Conjecture 12.151 belongs to this class of conditions and it claims that 
p = r cx is achievable; therefore, in this sense the conjecture claims the best possible condition in this 
class. In particular, this means that the conjecture gives the best possible spread of data points and 
the largest region of convergence, i.e., it allows both {xi}f =1 c B(o,r cx ) and x° e B(o,r cx ). 

Now let us see in what sense the step-size interval in Conjecture 12 . 1 51 is optimal. One can verify that 
Hb(o,p),p m Conjecture 12. 151 is the smallest uniform upper bound on the eigenvalues of the Hessian of 
f p in B(o, p). Here, by a uniform bound we mean a bound which is independent of the data points, the 
weights, and o. Based on Remark l2.14[ if we know that each iterate continuously stays in B{o, p' s A ) 
and further if we only know H B ^ o p ^ p , then from Theorem 1 2 . 1 31 we see that tk = jj — — - — is the best 

uniform a-priori step-size (in the sense that it yields the smallest uniform a-priori q). Next, notice that, 
with this tk, the smaller the p', the smaller the Hb( ,p'). p and hence the smaller the q' (the asymptotic 
q) will be, see Remark 12.141 Consequently, tk = -77 — at p' = p gives the smallest asymptotic q 

among all best uniform a-priori step-sizes tk = ~s — , where p < p' < r cx . Conjecture 12.151 claims 

n B{o.p').p 

that, indeed, p'g^pp can ^ e as smau as P (independent of 8,A.,p). Therefore, in summary, among 

all conditions in class (C), the sub-class which prescribes t$ a p p> p = 77 allows to have the 

B( -"' f ''e,A,p,p hP 

smallest uniform a-priori q (by choosing tk = -77 ), and in this sub-class, Conjecture 12.151 

gives the largest ts.A,p,p',p, therefore it allows for the largest step-size and hence the smallest uniform 
asymptotic q in this sub-class. We stress that this sense of optimality of the step-size interval should 
not be construed as giving the best speed of convergence for any actual data configuration; rather it 
gives the best uniform lower bound on the speed of convergence in Theorem 12.131 This means that 
for all data configurations, weights, and initial conditions in B(o,p) the actual speed of convergence 
will not be worse than the one predicted by Theorem 12.131 where the associated q in (|2.14p is the best 
uniform a-priori q. Quite similarly, one could argue that the step-size interval in Conjecture 12.151 is 
optimal in the sense that allows for the most reduction per iteration in the upper bound given in (|2.10l) 
of Proposition 12.71 

The proof of the conjecture in the case of manifolds with zero curvature is quite easy and straight- 
forward. However, the general case seems to be difficult. The main difficulty in proving Conjecture 
12 . 1 51 is in proving that the iterates continuously stay in B(o,p). Nevertheless, in Theorem l3.6l we prove 
the conjecture for manifolds of constant nonnegativc curvature. As this proof suggests, we believe 
that the difficulty in proving this conjecture has more to do with geometry (than optimization) and 
the need of good estimates (which currently seem not to exist) about the behavior of the exponential 
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map in a manifold. Our proof of Theorem 13.61 certainly constitutes some strong evidence that the 
conjecture also is true for manifolds of nonnegative curvature. For manifolds with negative curvature 
we have also some evidence that the conjecture is true. For example, the conjecture is trivially true 
if all the data points are concentrated at a single point in B(o,p), and by continuity, it is also true 
if all the data points are concentrated enough around a single point in B(o,p). Weaker convergence 
results can be established with some efforts. For example, in Section 01 we derive weaker convergence 
results in Theorems 14.11 and 14.21 As a comparison, Theorem 14.11 gives smaller allowable spread and 
smaller region of convergence than Conjecture 12.151 Theorem 14.21 on the other hand, gives allowable 
spread and region of convergence which could be very close to -B(o, r cx ), but the allowable step-size is 
restricted significantly in this theorem. 

Remark 2.16. It is interesting to note that except for the L 2 center of mass in a manifold of nonnegative 
curvature, in all other cases the best step-size depends on p and the radius p, where the latter in a typical 
scenario is not often known a-priori or at least its estimation requires further processing. Moreover, for 
manifolds of negative curvature a lower bound on the curvature is also needed to determine the best 
step-size. Fortunately, the class of manifolds with nonnegative curvature already covers a big bulk of 
applications, and since these manifolds are usually compact an a-priori upper bound on p (e.g., the 
diameter of the manifold or r cx ) can be used to determine a step-size (obviously, not the best one) for 
values of p other than 2. 

Remark 2.17 (Related to Remark 12. 1 4 j) . It is useful to put this conjecture in some context, especially in 
view of Theorems I2.10l and l2.13l and Remark l2.14l It should be clear from our discussions in Subsection 
12.1.41 and Remark 12.141 that when A > 0, the conjecture claims convergence for initial conditions in 
regions in which the Hessian of f p is not necessarily positive-definite. Therefore, what really could 
help us in proving this result is Theorem 12.101 and not Theorem 12.131 Although, already assured of 
convergence, we can use Theorem 12.131 to give us an asymptotic behavior of the algorithm. All our 
proved convergence theorems (which are Theorems 13. 6[ 14. 1\ and I4.2j) are proved using Theorem 12.101 
In Theorems 13.61 and 14.11 both the initial conditions and the trajectories of the algorithm can lie in 
regions in which only this theorem applies. The case of Theorem 14. II is rather interesting. Under the 
conditions of Theorem 14.11 the initial condition must lie in a region in which f p happen to be strictly 
convex (since -|r cx < j^f^, see Subsection 12. 1.4|) . however, the trajectory of the algorithm can visit a 
region in which the Hessian of f p is not positive-definite. 

Remark 2.18. From (|2.8|) we have ||V/ p (x)| < (2p) p ^ 1 for x e B(o,p), and we can easily verify that 
Hb{o, p ).p ^ (2/o) p ~ 2 - Consequently, for each x k we have tk || Vf p (x k ) \\ < 2p < injM. Therefore, the 
conditions of Proposition 12 .91 arc satisfied, and we could drop the adverb continuously in the statement 
of the conjecture. It is interesting to note that, this would have not been the case if we had the condition 
t e (0, ) (which is enough for reduction at each iteration but not enough for continuous stay 

in B(o,p)). As another example, in Theorem 14. 1[ since we allow t s (0, -n — ) we do not have the 

-"-B(o,p) 7 p 

equivalence between "staying" and "continuously staying." 

Remark 2.19 (Gradient vs. Newton's). According to this conjecture the domain of convergence of 
gradient descent for finding the center of mass is relatively large, in the sense that the algorithm can 
start from any point in the same ball that contains the data points, and moreover, the radius of the ball 
is not restricted more than what the existence and uniqueness Theorem l2. 61 requires. In comparison, if 
we consider implementing Newton's method for finding the center of mass (as for example in |15j ). we do 
not expect to have such a large domain of convergence. We know that, in general, the gradient descent 
method has a larger convergence domain compared with Newton's method, but still one might have 
slight hope that one could prove such a large domain of convergence for Newton's method applied to 
our problem. However, as mentioned in Subsection 12. 1.41 if A > and ^== < p < r cx , then the Hessian 
of f p might be in-definite at some points in B(o,p). This fact, ruins our hope for having Newton's 
method with a domain of convergence as large as the gradient method, since Newton's method involves 
inverting (in an appropriate sense) the Hessian operator (see |15j . [T], or [37] for more details). In fact, 
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this suggests that, in a manifold of positive curvature, even implementing Newton's method for data 
points which are spread in a ball of radius larger than might be impossible or very difficult. In 
fact, results in [15] (in which the estimated domain of convergence for Newton's method lie in balls of 
radius smaller than 4 J^ ) also corroborate this observation. 

2.3. Prior work. There are not many accurate and correctly proven [^results available about the 
convergence of gradient descent (and more specifically constant-step-size) for locating the Riemannian 
center of mass. For constant step-size algorithms, the most accurate and useful results are due to Le 
[2"5] and Groisser [T5J HH] f° r the L 2 mean. Both authors show that if the data points are concentrated 
enough then the map x e~xjp x (-V f 2 (x)), when restricted to a small neighborhood of x%, is a contrac- 
tion mapping (with fixed point x 2 ); and using this they prove a convergence result. Groisser's results 
are more general and allow to analyze both the Newton's method and the gradient descent method, 
while Le's result gives a slightly better bound on the allowable spread of the data points. Le shows 
that when M is a locally symmetric manifold of nonnegative sectional curvature and p < -^r cx , then 
with step-size tf. = 1 and starting from the center of the ball B(o,p), Algorithm [1] locates the global 
Riemannian mean (see Corollary 2 in [25]). In fact, Le shows that the map x exp x (-\7 f 2 (x)) is a 
contraction mapping when restricted to B(x 2 ,p). Since x 2 is a fixed point of the map, starting from o 
the iterates will not leave B(x 2 ,p) (but not B(o,p), necessarily). Le's result leaves room for significant 
improvement in the allowable spread of data points compared to our Conjecture [2TT5J Notice that Le's 
result can be used to deduce convergence for an arbitrary initial condition in B(o,p) assuming a p 
half as before, that is p < ^j^cx- This is obviously a more practical scenario. Our Theorem 14. II (which 
needs only p < \r cyi and does not require local symmetry or nonnegative curvature) is a considerable 
improvement over Le's result. Still our Theorem 13. 61 which is the best one can hope for in the case of 
manifolds of constant nonnegative curvature, is an even further improvement over Le's result (when 
applied to these manifolds). 

In [26j a convergence result is given for a (hard-to-implement) gradient method which varies the 
step-size in order to confine the iterates to a small ball. A result in |23j is somewhat similar in nature 
to our Theorem 14.11 yet it does not yield an explicit convergence condition. Local convergence of 
Algorithm [T] with tk = 1 on §" under the generic condition of u x° being close enough to the center" is 
argued in [9]; however, such a condition is of little practical use. A few authors have also studied other 
related problems and methods e.g., stochastic gradient methods [4], projected gradient methods [24], 
Newton's method [9] and |15j . and variable step-size gradient algorithm for the L 1 mean or median 

EH. 



3. Convergence on Manifolds of Constant Nonnegative Curvature (An Optimal 

Result) 

In this section, we prove Theorem l3. 61 which is essentially Conj ecture 12 . 1 5 1 for a manifold of constant 
nonnegative curvatureJ3 

3.1. A useful triangle secant comparison result. Here, we prove the comparison result used 
to prove Theorem 13.61 (see also Figure [2] and Theorem I3.3|) . Referring to Figure [21 the theorem is 
about comparing the lengths of corresponding geodesic secants xm and xrh of two (geodesic) triangles 
^Vixy2 and ^y\xy 2 which are in M = § 2 and R 2 , respectively, and in which the angles at x and x 
and their corresponding sides are equal. The bi-sectors of angles ^-y\xy 2 and ^y\xy 2 are examples of 
such secants and the theorem implies that the bi-sector of ^y\xy 2 is larger than that of ^y\xy 2 . In 
general, it should be obvious that such a comparison is meaningful only if M is a manifold of constant 
(sectional) curvature or if M is two-dimensional. In the case of a constant curvature manifold M 



A mistake made by some authors (see e.g., 1271 and |130 in proving such results has been to wrongly assume that 
f p : M -* M is globally convex (or strictly convex) if the data points are in a small enough ball, which in general is not 
true, e.g., if M is compact (see Theorem 12. 2|l . 

^To be accurate, Theorem 13.61 is little bit more than Conjecture 12.151 as it also states a result about the iterates 
getting trapped in the convex hull of the data points. 
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the enabling property is the well-known axiom of plane: Let x e M and assume that W c T X M is a 
fc-dimcnsional subspace oiT x M, then the set exp x (W n B(0 X , p)) is a totally geodesic submanifold of 
W. Here B(O x ,p) is the open ball of radius p around the origin of T X M and < p < injAf (see e.g., 
[32j p. 136]). 

It seems that this kind of comparison cannot be deduced, at least immediately, from standard 
Toponogov's comparison theorems (see e.g., [TT], [32], [E], and [2D]). Although we are almost certain 
that this result has been known before, we were not able to find either a proof for it or even its 
statement. Our proof here is a direct one and is not in the spirit of standard comparison theorems 
based on comparison of solutions of two differential equations. 




X X 



Figure 2. &yixy2 is a triangle in a manifold of constant positive curvature and 
A j/i5y2 is the corresponding triangle in Euclidean space. Corresponding equal angles 
and sides are marked. According to Theorem 13.11 the geodesic secant xm is longer 
than the secant xm. 



Theorem 3.1. Consider a triangle with vertices x,yi,y2 € M = § 2 (or S n ) and with minimal geodesic 
sides xyi, xy2, J/1J/2 of lengths b, c, and a respectively, where a + b + c < 2tt. Assume that the internal 
angle ^yixyi is equal to a. (0 < a <tt). Consider another triangle in M. 2 (orW 1 ) with vertices x,yi,y2 
and corresponding side lengths a and b such that b = b, c = c and ^y~\xy2 = at. Consider a geodesic 
secant 7 in triangle y\xy2 passing through x making angles ot\ and U2 (ct\ + ct2 = a) with minimal 
geodesic sides xy\ and xy2, respectively. Denote by m the point where the geodesic secant meets the 
minimal geodesic side y\y2 for the first timeW\Similarlv, let a secant line of triangle y\xy2 passing 
through x make angles ct\ and 0(2 with sides xy\ and xjj2, respectively, and denote by m the point 
where this secant line meets the side j/ij/2- Then we have that the length of the secant segment xm is 
larger than or equal to the length of the secant segment xm with equality if and only if ol\ = 0, 02 = 0, 
6=0, c = 0, or a = tt. More generally, if M is a manifold of constant curvature A > 0, the same result 
holds if 2;, 2/2 belong to a ball of radius not larger than r cx . 

Proof. Denote the length of the geodesic secant segment xm by z(b, c; oi, 012). Using spherical trigono- 
metric identities (e.g., [28j p. 53]) one can show that (see also [28l p. 55]) 

cot b sin 02 + cot c sin oi 



(3.1) cot z(b, c; ai, 0.2) 
Similarly, denote the length of the secant segn 

(3.2) 2(6, c; 0-1,0-2) 



sin(ai + 012) 

Similarly, denote the length of the secant segment xm by 0,01,02)- It is easy to see that 

be sin(oi + 02) 



frsinoi + csino2 

where in both relations 01 + 02 = o. The claim is obvious for 01 = 0, 02 = 0, b = 0, or c = 0. Also from 
b + c + a < 2tt one can show that a = tt only if a = b + c in which case again we have equality. Note 
that z and z are both smaller than tt and therefore to show z(b,c; 01,02) > z(6,c;oi,02) we could 
show cot z{b, c; oi, 02) < cot z{b, c; Oi, 02) with oi + 02 = o (see (|3.ip and (|3.2[) ). First, we note that 



^Both a = ct\ + Q2 and the secant meeting the minimal geodesic side yiy2 follow from the axiom of plane. 
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t ^ = c °t \ is strictly concave in the interval (-,00). To see this notice that g is smooth and its 
second derivative is 

o 1 rot — 

(3.3) 9 » W = __(i + cot 2_)(i__± ); 

which is strictly negative. Note that in the above, the last term is positive in (•=, 00) because tcott < 1 
in (0,tt). Now set tx = hh = 1 , Ai = . , sin(a '; i . ) — and A 2 = ■ , sin ?! . = 1 - Ai. Notice that 

\ t > b 1 ^ c 1 1 sin(Q-ftiJ+sinai ^ sm(a-Qi)+sinQi A 

h,h, Aiti + A 2 i 2 > - (because < 6, c < n). From Ai(/(ii) + A 2 g(i 2 ) ^ 3(Ai£i + A 2 i 2 ) in which equality 
is only if t\ = ti (or equivalently if b = c) we have 



sin(a - ai) cotfe + sinai cote / be (sin(a - a±) + sin a 

(3.4) . 11 <cot \ \ )' 

sm(a - ai) + smcti \ osinai + csm(a - a 



*1\ 

1) r 



It is easy to verify that cott < scot (ts) for < t < n and < s < 1 with equality only if s = 1. Without 
loss of generality we can assume < ai < %, and hence for s = — s "\° . , we have < s < 1 with 

J 1 2 1 sin(Q-aiJ+sinai ' 

s = 1 only if et\ = 0. Consequently by applying the mentioned fact to the right hand side of (|3.4[) with 
the given s we have (assuming ct\ > 0) 

. . sin(a - ai) cot6 + sincti cote , be sin a s s'ma 

( 3 - 5 ) ■ / * : " < cot (r~- \) —< n : ' 

sm(a - ol\) + smai osinai + csm(a - ol\) sin(a - ax) + sinai 

which means that z{b, c, ax, a - ax) > z(b, c, ax, a - ai). □ 

3.2. Riemannian convex combinations. Intuitively, one would like to think of exp x (Y,iLx w i ^P^ 1 x i) 
as a Riemannian convex combination with similar properties as the Euclidean convex combination. We 
call this a convex combination of {xi\f =1 c M with respect to x e M and with weights (wx, ■ ■ ■ ,ifjv)- 
For the case of a manifold of constant nonnegative curvature we show that this is a valid definition. 
The following general proposition is very useful in making sense of Riemannian convex combination 
as well as proving Theorem 13.61 

Proposition 3.2. Let S c AI be a strongly convex set containing {xi}^ =1 and let x e S. Assume that for 
arbitrary weights < Wx,W2 < 1 (with Wx+w^ = 1) and for any 2/1,1/2 e S, cxp x (t(wx exp^ 1 yx + W2 exp" 1 J/2 
5* for t € [0, 1]. Then for every set of points {xi}f =l c S and corresponding weights < u>, < 1 (with 
T,i w i = ljj ex Px(^ ££1 w i exPx* x i) a ^ so belongs to S forte [0,1]. 

Proof. We prove the claim for N = 3 and for larger N it follows by induction. Let y(t) = exp x {tY^=x w i ex P 
Note that we can write 



(3.6) y(t) = exp ; 



/ / -1 / \f w 2 -1 "^3 -1 \\\ 

\t\wxex\o x Xx + {i--wx)( cx Px x 2+ ex Px x 3,)ll 

\ \ W2 + W 3 W2 + W3 ) ) 



Since by assumption xi = exp a .( W ™+ W3 exp^. 1 X2 + w ™+ w ^Pa; 1 ^3) belongs to S, there exists a unique 
minimizing geodesic between x and i 2 - Therefore, exp" 1 i 2 is well-defined and belongs to the in- 
jectivity domain of exp x and we have exp^ 1 i 2 = W ™ + 2 W3 e^^ 1 x 2 + w ™+ w exp~ X3. On the hand, by 
our assumption exp x (t(wx exp^ 1 xx + (1 - Wx) exp" 1 x 2 )) belongs to S, which means that y(t) 6 S for 
£ € [0, 1]. □ 

An example of a set S is the convex hull of {xi}f =1 . Recall that the convex hull of A c M (if it 
exists) is defined as the smallest strongly set containing A. If A lies in a strongly convex set obviously 
its convex hull exists. It is known that the convex hull of a finite set of points in a constant curvature 
manifold is a closed set. Also in a manifold of constant curvature the L p center of mass (1 < p < 00) of 
{xi}f =1 with weights Wi > belongs to the convex hull of {x{\f =1 and if Wi > for every i, it belongs to 
the interior of the hull [3]. 



17 Notice that if M = R n , cxp x (Y,fLi w i ex Px 1 x i) translates to 1 + T.iii w i( x i ~ x )> which is independent of x. In 
a nonlinear space the convex combination does not enjoy this base-point independence and that is why we have been 
explicit in calling exp^ ( T,iLi Wi ^p^ 1 Xi) a convex combination with respect to x. Moreover, to have "nice" properties, 
x cannot be arbitrary and must belong the convex hull of {xi}^ =l , as explained in Remark 13.41 
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The next theorem makes exact sense of an intuitive notion of Ricmannian convex combination in a 
manifold of constant nonnegative curvature: 

Theorem 3.3. Let M be a Riemannian manifold of constant non-negative curvature and S a strongly 
convex set containing {xi\f =1 . Assume that S lies in a ball of radius of at most r cx . For every 
x e S and arbitrary weights Wi > (with Y.i=i w i ~ ^-)> we have exp x (tY,^ =1 Wi exp^ 1 Xi) e S for t e 
[0,1]. In particular, if S is the convex hull of {xi}f =l , then the Riemannian convex combination 
exp x (Y,i = i Wi exp" 1 Xi) belongs to S for every x e S. 

Proof. By Proposition I3.2[ it suffices to show that for arbitrary yi,y 2 e S and weights (wi,W2) with 
wi + w 2 = 1 we have m(t) = exp x (t(w\ exp' 1 yi + u>2 exp' 1 y 2 )) e S for e [0,1]. This follows from 
comparison Theorem 13.11 To see this, first note that, in triangle xy\y 2 in Figure [2 there is a 1 - 1 
correspondence between the weight pairs (w\,w 2 ) and the angle pairs (a\,a 2 ), where ot\ + a 2 = a (the 
axiom of plane is essential for this fact to hold). If x,yi,y 2 e S, then by strong convexity of S we 
have to e S. Since the distance between x and to(1) is nothing but the length of secant xfh in triangle 
xyiy 2 , it follows from Theorem 13.11 that rh(t) must belong to S for t e [0, 1]. □ 

Remark 3.4. There is a subtle difference between the convex combination in Euclidean space and in a 
manifold of constant nonnegative curvature. In Euclidean space exp x (Y,iii Wi exp x x xi) belongs to the 
convex hull of {xi}^ =l for an arbitrary x (i.e., inside or outside the convex hull). However, it follows 
from Theorem 13.11 that . in the case of a manifold of constant nonnegative curvature, if x is not in the 
convex hull of {xi}f[ 1 , then exp x (Y,iii Wi exp^ 1 Xi) also does not belong to the convex hull, necessarily. 

Remark 3.5. One might wonder in what directions the above theorem can be extended. One can 
show that in a manifold of constant negative curvature the inequality in Theorem 13.11 holds in the 
reverse direction, that is, the secant in the manifold is shorter than the corresponding secant in W 1 . 
This by itself implies that, in a manifold of constant negative curvature, exp x (Y,iLi u>i exp" 1 Xi) does 
not belong to the convex hull of {xi}f =1 , necessarily; and one needs to scale down the tangent vector 
HiLi Wi exp" 1 Xi to ensure it belong to the convex hull. This scaling somehow should be related to the 
size of the convex hull or the minimal ball of {xi}^ =1 (see Conjecture 12. 15l and Rcmark l3.8[) . Recall that 
the minimal ball of {xi}f =1 c M is a closed ball of minimum radius containing {xi}f =1 (see [3] for more 
on the minimal ball). Furthermore, even for nonnegative variable curvature, exp x {YJi=i Wi exp^ 1 xi) 
belonging to the convex hull of {xi}^ =1 seems implausible. However, in this case, we conjecture that 
exp x (Y,iii Wi exp^ 1 Xi) belongs to the minimal ball of {xi}f =l . Also another curious question whose 
answer is most likely "yes" is: "In a manifold of constant nonnegative curvature, can one generate the 
entire convex hull of {xi}f =l by varying the weights in exp x {Yd=i ifj exp" 1 x^), where x belongs to the 
convex hull?" 

3.3. Convergence result. We are now ready to state and prove the main theorem of the section. 

Theorem 3.6. Assume that M is a complete Riemannian manifold with constant nonnegative sectional 
curvature A = <5 > 0. Let p = 2 and {xi}^ c B(o 7 p), where p < r cx (see 112.1)) ). In Algorithm^ choose 
an initial point x° e B(o,p) and a constant step-size = t, where t e (0,1]. Then we have: The 
algorithm is well-defined for every k>0, each iterate continuously stays in B(o,p), f 2 (x k+1 ) < f 2 (x k ) 
with equality only if Xk = x 2 , and x k -* x 2 as k -* oo. Moreover, if for some k! > 0, x k belongs to the 
convex hull of {xi]f =1 , then x k also belongs to the convex for k > k' . More generally, for 2 < p < oo the 
same results hold if we take t e (0,t p . p ] with t PyP = H — - — where Hb(o.p),p = (p ~ l)(2/o) P • 

Proof. The fact that each iterate continuously stays in B(o,p) follows from Theorem 13.31 The same 
argument shows that if x k is in the convex hull of {xj}^, then x k also belongs to the hull for k > k' . 
By Proposition 12.71 step-size tk = t at each step results in strict reduction of f 2 unless at x 2 . Next, 
the iterates converging to x 2 follows from Theorem 12.101 by taking S as B(o,p) or the convex hull of 
{xi} 1 ii 1 . For the general p, we notice that „ "- 1 — V fp(x) can be written as T,i=i Wi exp" 1 Xi, where 

HiLi Wi < 1 and Wi > 0. Therefore, again we can use Theorem 13. 3[ and the rest of the claims follow 
similarly. □ 
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Remark 3.7. In a manifold of non-constant curvature, Theorem 13.11 is not meaningful; therefore, a 
direct application of this theorem to prove Conjecture 12.151 for an arbitrary manifold of nonnegative 
curvature is not plausible. However, it is likely that comparison Theorem 13.11 might be used to prove 
a useful comparison theorem for the case of manifold of variable nonnegative curvature, which in turn 
could allow us to apply Proposition 13.21 

Remark 3.8. In [TS], Groisser introduced the notion of tethering: A map $ : M -> M is called tethered 
to {xi}f[ 1 if for every strongly convex regular geodesic ball B containing {xi}^i, $ is defined on 
B and &(B) c B. To avoid technical difficulties which probably have little to do with the essence 
of the property of tethering, we replace "every strongly convex regular geodesic ball" with "every 
ball of radius less than or equal to r cx ." Groisser's definition is more general than ours. Groisser 
conjectured that tethering "might occur fairly generally." Several results in |15j can be strengthen 
if tethering assumption holds (even in this new sense). In the above theorem, we proved that for 
t e [0,1], the map x i-» exp x (-tV/2(^)) is tethered to {x{\f =l in manifolds of constant nonnegative 
curvature. We conjecture that the same holds for manifolds of nonnegative variable curvature. Based 
on the discussion in Remark l3.5[ we conjecture that tethering in manifolds of negative curvature docs 
not hold. As mentioned in Remark 13.51 (and also expressed in Conjecture I2.15p . we conjecture that 
in order for x *-* exp a ,(— £Vf 2 (x)) to map B(o,p) □ {xi}f =1 to itself, t should be smaller than 1. More 
specifically, we conjecture that t cannot be independent of p, and t e [0, Cs ( 2p ) ] sum ces. 

4. Convergence results for manifolds of arbitrary curvature 

Here, we prove two classes of results which are sub-optimal compared to Conjecture 12.151 In the 
first class the spread of data points is compromised to guarantee convergence. In the second class, the 
step-size is restricted more than what is needed to reduce the cost at each iteration to ensure that the 
iterates do not leave a neighborhood in which x 2 the only zero of V/2O 

4.1. Compromising the spread of data points. Here, by compromising the allowable spread (or 
support) of the data points we give a condition which guarantees convergence of Algorithm [T] to the 
global center of mass. 

Theorem 4.1. Setp= 2 and let x 2 be the L 2 center of mass of {xi}f =l c B(o,p) c M where p< ^r cx . 
Define ts, p = — where Hb(o.3p) = c s(^p) and c K is defined in 12. 3\) . In Algorithm^ assume 

that x° € B(o,p) and for every k > choose tk = t, where t e (0,2ts lP ). Then we have the following: 
The algorithm is well-defined for all k > and each iterate of the algorithm continuously stays in 
B(o,3p), f-i{x k+1 ) < f-i{x k ) for k>0 (with equality only if x k is the Riemannian center of {xi}f =1 ), 
and x k -* X2 as k -*■ 00. Moreover, if x° coincides with x° , then p < \r cyi is enough to guarantee 
the convergence, in which case each iterate of the algorithm continuously stays in B(o,2p) and we 

can take tg p = jj— where -ffe(o,2p) = Cg(3p). More generally, for 2 < p < 00 the same results 

hold if we replace H B ( o 3p j and H B ( o 2p ), respectively, with H B ^ o 3p ^ p = {Ap) p ~ 2 max{p - 1, ca(4p)} and 
H B (o,2p),p = (3p) p ~ 2 max{p- l,c s (3p)}. 

Proof. For any x e M\B(o, Bp) we have f 2 (x) > 2p 2 > f 2 (x°) (sec (|27f|)). From (|2"3j) and (|2~3|) and that 
{xi}^ c B(o,p), one sees that ffs(o,3p) = c s(^p) is an upper bound on the eigenvalues of the Hessian of 
f 2 in B(o, 3p). Moreover, by Proposition l2.7[ for small enough t e (0, 2tg tP ), s cxp a , (-sV/2(x )) docs 
not leave B(o, ip) for s 6 [0,i], and we have /(exp^-iV/^^ )) < f(x°), with equality only if x° is the 
unique zero of V/2 in B(o, 3p). However, s 1-* exp^o (-sV/2(a; )) must lie in B(o, p) for all s in (0, 2ts, P ), 
since on the boundary of B(o,3p), f is larger than /(x°) and by continuity s h> cxp^-sV/^a; )) 
cannot leave B(o, 3p) without making f 2 (exp x o (-5X7/2(2;°))) larger than f2(x°) inside B(o, 3p), which 
is a contradiction. Therefore, for any t e (0, 2tjp), the iterate a; 1 = exp^^tV f 2 (x )) continuously 



-^The reader can convince herself or himself that Conjecture 12.151 not only gives a smaller best a-priori q than both 
Theorems l4.1l and l4.2l but also it gives a smaller asymptotic q (see Remark l2.14H when respective best a-priori step-sizes 
are employed. Notice that the asymptotic q is data dependent. 
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stays in B(o,3p) and f2(x 1 ) < f2(% ), with equality only if x° = X2- A similar argument shows that 
for any y e B(o,3p) such that f 2 (y) < M^ ), f(exp y (-sVf 2 (y))) for s e [0,t] belongs to B(o,3p) 
and /(exp (— tV/2(j/))) ^ f(y) with equality only if y = X2- In particular, assuming x k e B(o,3p) and 
f2{x k ) < f 2 (x°), by setting y = x k e B(o,3p), we conclude that continuously stays in B(o,3p) 
and f(x k+1 ) < f(x k ) with equality only if x fc = X2- Note that for any point y in B(o,3p) \ B(o,p) we 
have d(y,Xi) < 4p < finjA/ for 1 < i < n; therefore, V f2{v) in fl2.8[) and hence (|2.9p in Algorithm Q] are 
well-defined. Next, by taking B(o,3p) as S in Theorem 12. 101 we conclude that i' -» i 2 as ^ oo. To 
see the claim about B(o,2p), note that if x° coincides with o, then we have f2(x) > \p 2 > f2(%°) for 
any x out of B(o,2p) and the derived conclusions hold with B(o,2p). The claims about 2 < p < oo 
follow similarly by further using (|2.6j) . □ 



The radius 3p above was found based on the simple observation that f p takes larger values outside 
of B(o,3p) than inside of B(o,p). Finer results should not be difficult to prove. 

4.2. Compromising the step-size. Here, we briefly describe another approach in which the step- 
size is further restricted to ensure that the iterates do not leave a ball larger than B(o,p)- and not 
B(o,p) itself. A rather similar idea has been used in [J] and [35] i an d here we partially follow the 
methodology in |38j . Specifically, given p and p' where p < p' < r cx and assuming the data points lie in 
B(o,p), by restricting the step-size we want to make sure that, starting from B(o,p'), the iterates do 
not leave the larger ball B(o,p'). 

For x inside B(o,p), let t x > denote the first time t h> r y x (t) = exp x (-f\7 f 2 (x)) hits the boundary 
of B(o 7 p'). Note that sup B ( o p ) ||V/2(x)|| < 2p, therefore we must have t x > tf x where 

^ 4 ^ t i a = ' mi xeB(o, p),yeM-,B(o,p>) d(x, y) _ p' - p 



2p 2p 

Similarly, for y in the annular region between B(o,p') and B(o,p), let t y > denote the first time 
t h> 7 y (i) = exp y (-tV/2(y)) hits the boundary of B(o 7 p'). For t h> ^d 2 (o, / -f y (t)) one writes the second 
order Taylor's series expansion in the interval [0,iy] as: 

(4-2) \d\o, ly (t y )) = \y 2 =\d\o, y ) + {-Vf2{y),-^V y 1 o)t y + ^ 

where s is in the interval (0,t y ). Next, using (|2.5p and noting that p 2 - d 2 {o,y) > we verify that 
,, 2(-V/ 2 (y), exp^o) 

( 6) v cs(p') 

where c$ is defined in (|2.3[) . Denote by ^-Xiyo the angle, at y, between the minimal geodesies from y 
to and from y to o. It is shown in Lemma 10 in }38| that 

sn A (d(y,o) - p) 



(4.4) cos +Xiyo\ 



sn A (d(y,o) + p) ' 



where snA is defined in (|2.2|) . Using this and observing that | V/2 (y) || > d(y, o) - p we have t y > t° u 
where 

(4.5) c m = >< ^, o) >< (d(y, o)-p)* snA ifr\: p) Y 

cs(p') sn A (d(y,o) +p) 

Also observe that (trivially) we must have t y > t° nt ' 2 , where 

(4.6) t° 



t oux,2 = p'-d(y,o) 
p+d(y,o) 



Obviously, t y must satisfy t y > maxjt™*' 1 , t° ut ' 2 }. Define 



(4.7) t exit = min{t m , inf max{i° u ' , t on ' }} 

y.p<d{y,o)<p' 
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where t [n , t° ut,:L , and t° ut ' 2 are denned in (|4.1j) . (|4.5|) . and (|4.6|) . respectively, with the assumption 
p< p' < r cx . We see that for any z e B(o,p') and any i e [0,i e xit], exp z (-^V/2(^)) belongs to B(o 7 p'). 
Notice that t = t e xit is indeed acceptable. Also observe that i e xit is larger than zero; since otherwise it 
can be zero only if for z in the region B(o,p') \ B(o,p) and very close to the boundaries of the region 
t° uttl and t° ut ' 2 both become arbitrary close to zero, which obviously cannot happen. Based on this 
analysis we have the following theorem. 

Theorem 4.2. Let p = 2, {xi}f =1 c B(o,p) and assume p < p' < r cx . Define H B ( a p /) = cs(p' + p) and 

set 

(4-8) tsA.p.p' = min{t cxit , — ^ }, 

n B(o,p') 

where £ cx it is defined in j^.7[ ). In Algorithm^ choose an initial condition x° e B(o,p) and step-size 
tf~ = t, where t e (0, 2t\ A()() ,)n[0, t cxlt ] . Then we have the following: The algorithm is well-defined for 
every k>0, each iterate continuously stays in B(o,p'), f2{x k+1 ) < f2(x k ) with equality only if x k = Xi, 
and x k -*■ x~2 as k -> oo . 

Proof. The fact that each iterate continuously stays in B(o,p') follows from preceding arguments. 
From this it we see that d(x k ,Xi) < injM for every k > and 1 < i < N, and hence the algorithm is 
well-defined for k > 0. The rest of the claims follow from Theorem 12. 101 □ 

Next, we give some numerical examples about the interplay between p, p' , and the step-sizes ac- 
cording to Theorem 14.21 and compare that with step-size and allowable spread from Conjecture 12.151 
and Theorem 14.11 First, let 5 = and A > and let p' = r cx . To have tf. = 1 we need to have 
p < fx ~ 0.0303r cx , while Theorem 14.11 gives much larger p, i.e., p < ^r cx . We can increase p and 
further restrict the step-size: If we set p = ^r cx , then we get t* s A , as 0.3965, if p = y^cx we get 
t* s A , = 0.0353, and finally when p = 0.99r cx we get t* s A , = 0.0033, all of which are considerably 
smaller than the optimal step-size of 1 in Conjecture 12.151 Yet, the added value is that we have con- 
vergence for more spread-out data points (i.e., going from p < ^r cx to almost p < r cx ). Next, let 5 < 0, 
A = 0, and p' - (this is just an arbitrary number). To get the optimal step-size in Theorem 

12.131 which is g 1 - and is equal to cs ^ p \ p ,^ , we need p < r 2 ~ 0.1950p'. Therefore, the t* s A p p , from 

Theorem 14.21 cannot be larger than the tg tP from Theorem 14.11 In fact, if set p = ^p' , then we need 
t* s A p p , = 0.3022 according to Theorem |4~2| while we have t$, p = 0.4632 from Theorem |4~T1 

Finer analysis could yield a larger estimate for the exit time than (|4.7|) . However, since cg(2p) is a 
upper bound on the eigenvalues of the Hessian of f% in B(o,p), such an improvement will not result 
in an optimal step-size better than tk = ctg(2p)^ 1 (cf. (|4.8[) and Coniecture l2.15l) . 

5. ON THE CONFIGURATION OF DATA POINTS AND THE LOCAL RATE OF CONVERGENCE 

In this section, we give a qualitative answer to the following question: "For which configurations of 
data points Algorithm[T]locates the center of mass very fast? very slowly?" We use the facts mentioned 
in Subsection 12 . 1 .61 to answer this question. 

We assume p = 2. From Theorem 12.131 and the definition of q in (|2.14j) it is clear that in addition to 
a the ratio is also important in determining the speed of convergence, and the asymptotic speed of 
convergence depends on the ratio in a very small neighborhood S around x. Obviously, the smaller 
the ratio is, the slower the convergence will be, and vice versa. In the Euclidean case the ratio is 1 
and with a = 1 we have q = 0; therefore, Algorithm [1] finds the center of mass in one step (see (|2.13l) 
and (|2.14jl ). However, in a curved manifold the ratio -g^ can be very small, due to drastic difference 
in the behavior of the Hessian of the distance function along different directions. Next we give simple 
examples that demonstrate this fact. 

We consider the case of constant curvature since in this case the eigenvalues of the Hessian of the 
distance function are the same along all directions but the radial direction. Furthermore, let us assume 



'in fact, according to the derivations, one could choose x° e B(o,p'). 
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M is a 2-dimcnsional simply connected manifolds with constant curvature, that is M = §^ where A = 1 
or A = -1 (with the convention § 2 = § 2 )- We construct two simple configurations for which Algorithm 
[T] converges very fast and very slowly, respectively. Consider four data points {xi}f =1 and the closed 
ball B(o,p) c M, where p < r cx . Assume x\ and x-i are on the boundary of the ball in antipodal 
positions and that x% and x^ are also in antipodal positions such that the geodesic 7 o;cl from o to 
x\ and the geodesic 70,23 from o to X3 are perpendicular at o. We denote this configuration by •*•. 
Obviously, x = o is the center of mass of {xi}f =1 with equal weights. It is easy to verify that for the •*• 
configuration the Hessian of f 2 at x = o both along "f x x and along 7 OX3 has eigenvalue \(pct&(p) + 1). 
Consequently, at o the ratio of the smallest and largest eigenvalue is 1, hence jf- « 1 around x = o; and 
therefore, one expects that the local rate of convergence will be very fast. The opposite configuration 
is *, that is, when 2; 3 and X4 coincide with x\ and x 2 , respectively. In this case, at x = o along r ) OXl 
the Hessian of f 2 has eigenvalue of 1 and in the perpendicular direction it has eigenvalue pct/±{p). 
Therefore, if A = 1, we have « pcotp around o which, in particular, can be very small if p is close 

to ^. If A = -1, we have « (pcothp) -1 , which again can be small if p is large. It is well known 
that the shape of the level sets of a function in a neighborhood of a minimizer is related to the ratio 
-g 2 - . If the level sets are very elongated or thin, this means that the Hessian has very small eigenvalues 

along longitudinal directions and very large eigenvalues along the lateral directions and hence -jf- can 
be very small. For our two configurations, we encourage the reader to compare the shapes of the level 
sets of f 2 in B(o,p) c S 2 for levels close to f 2 (o) (especially when p is close to |) with the level sets 
of f 2 in B(o,p) c S 2 1 for levels close to /2(o) when p is very large). 

As a tangible example, on the standard unit sphere S 2 we run Algorithm [T] for both the configurations 
with two different values of p « 0.35-7T and p rs 0.47-7T. The initial condition is chosen randomly. The 
step-size is chosen as tk = 1- Figure [3] shows the distance d(x k ,x) in terms of the iteration index 
k. It is clear that for the * configuration the convergence is slower than the convergence for the •*• 
configuration, and as p increases, convergence for both configurations becomes slower. However, for the 
* configuration as p approaches ~, the convergence becomes extremely slow and the •*• configuration 
is much more robust in that sense. Note that when /? ~ -| even the center of mass of the * configuration 
is on the verge of non-uniqueness and this causes further (error) sensitivity and hence poor convergence 
(see [2] on the issue of high noise-sensitivity of the Riemannian mean in positively curved manifolds). 

Although our example is rare in statistical applications, in a more general setting also one expects 
that if the configuration of data points is such that the convex hull of the data points has an elongated 
shape (especially if the length of the convex hull is large), then locating the Riemannian center of 
mass becomes a difficult problem (with the exception of the Euclidean case). Our analysis does not 
tell the whole story in the case of non-constant curvature and we need more detailed analysis that 
takes into account the variability of eigenvalues of the Hessian of the distance function along non-radial 
directions, as well. 

6. Concluding Remarks 

Our goal has been to give and prove the the best possible conditions for convergence of the popular 
constant step-size gradient descent for finding the Riemannian center of mass. We argued that Con- 
jecture gives such best conditions (in some specific yet general sense). The proof of the conjecture 
seems to be difficult, because, in particular, it appears that such a proof requires some very deep 
understanding about the behavior of the exponential map of a Riemannian manifold. We proved the 
conjecture for manifolds of constant nonnegative curvature and our proof was based on comparison 
Theorem 13.11 which seems to give a better estimate of the behavior of the exponential map than what 
(possibly) could be gained by standard comparison theorems. Therefore, extending this comparison 
theorem (in appropriate sense) to manifolds of variable curvature not only could help prove Conjecture 
12.151 but also could provide deeper understanding of the behavior of the exponential map of a general 
manifold. Our Theorems 14.11 and 14.21 (whose proofs arc based on simple observations) give weaker 
convergence conditions, but still these results are considerably better than the available ones. Finer 
analysis could improve the results of these theorems, as well. 
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Convergence for • :• and : configurations with two values of p 

10° | , , , 1 




k = number of steps 

Figure 3. Convergence behavior of Algorithm Q] with step-size ifc = 1 for locating the 
center of mass two data point configurations denoted by •*• and * on the unit sphere 
§ 2 . 
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